Task¶

We will be investigating the same dataset as in HW1: spambase.csv from OpenML-100 databases. This database concerns emails, of which some were classified as spam emails (~39%), whereas the rest were work and personal emails. We will be training a few Random Forest Classifier models and an XGBoost and comparing their permutation-based variable importances. We will also look at and compare other types of variable importances, namely TreeShap and sklearn featureimportance (mean decrease in impurity)

Permutation-based Variable Importance¶

Basic RF model (depth=8, variables=0.3)¶

chart1

Shallower RF model (depth=3, variables=0.3)¶

chart2

Comment:

  1. The top 3 most important variables remain the same (only switch in ordering)
  2. The shallower model increases the relative importance of the top 3 variables compared to other ones - to be expected, as a shallow model has less steps at which it can choose a variable to do a split, but because it almost always chooses one of the top3 variables, it has less chances to choose other variables

Less variables RF model (depth=8, variables=0.1)¶

chart3

Comment:

  1. The top 3 most important variables are the same in the same order
  2. The relative importance of the top 3 variables compared to others is slightly smaller - can be explained due to the smaller 0.1 fraction of variables, which more often gives a chance for the less important variables to make a split in the tree

XGBoost model¶

chart4

Comment:

  1. XGBoost has a completely different importance structure than the random forest
  2. We see that XGBoost focuses more on more niche variables, whereas RF focuses on the most occuring ones. This is expected as XGBoost learns to pay attention to the less frequent but important variables by weighting appropriate observations in its next iterations

Other importance types¶

Below a table showing the variable importances of our original RF model (depth=8, variables=0.3), but using inbuilt SKLearn variableimportances (mean decrease in impurity)

Variable Importance
charfreq%21 0.191269
charfreq%24 0.171319
word_freq_remove 0.134845
word_freq_free 0.082156
capital_run_length_average 0.073821
word_freq_your 0.058051
word_freq_hp 0.052018
word_freq_money 0.036021
word_freq_our 0.024775
word_freq_000 0.019361

Comment:

  1. The top3 most important variables remain the same (only with switched ordering)
  2. Overall the values are comparable

Shap¶

Below a chart showing the variable importances of our original RF model (depth=8, variables=0.3), but using TreeShap:

chart5

Comment:

  1. The top3 most important variables remain the same (only with switched ordering)
  2. Rather hard to compare directly as shap is used to explain single isntances, wheras permutation variable importance looks at the whole model
  3. Using a beehive plot, we can actually see the direction of contribution of each variable in TreeShap and their frequency of occuring

Appendix¶

Data preparation¶

In [1]:
import numpy as np
import pandas as pd
import dalex as dx
import lime

spambase = pd.read_csv("spambase.csv")
In [2]:
df = spambase.drop(spambase.columns[0], axis=1) #Cleaning first column which is just index
In [3]:
df.describe()
Out[3]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average TARGET
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.005444 0.031869 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 0.394045
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.076274 0.285735 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 0.488698
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 0.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.000000 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 1.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 2.170000 10.000000 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 1.000000

8 rows × 56 columns

In [4]:
X = df.loc[:, df.columns != 'TARGET']
In [5]:
y = df.loc[:, df.columns == 'TARGET']
In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)
In [9]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 5)

Random Forest¶

In [10]:
from sklearn.ensemble import RandomForestClassifier
In [11]:
RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\3518268967.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X_train, y_train)
Train accuracy:  0.9553140096618358
Test accuracy:  0.9609544468546638

Dalex variable importance¶

In [13]:
RFexplainer = dx.Explainer(RF_final, X_test, y_test)
RFexplainer.model_performance()
Preparation of a new explainer is initiated

  -> data              : 461 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 461 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x00000137E7081940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0169, mean = 0.416, max = 0.991
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.671, mean = -0.00337, max = 0.859
  -> model_info        : package sklearn

A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
  warnings.warn(
Out[13]:
recall precision f1 accuracy auc
RandomForestClassifier 0.926316 0.977778 0.951351 0.960954 0.993397
In [14]:
pvi = RFexplainer.model_parts(random_state=0)
In [15]:
pvi.result
Out[15]:
variable dropout_loss label
0 word_freq_report 0.006545 RandomForestClassifier
1 char_freq_%23 0.006562 RandomForestClassifier
2 word_freq_data 0.006586 RandomForestClassifier
3 word_freq_conference 0.006588 RandomForestClassifier
4 word_freq_85 0.006592 RandomForestClassifier
5 word_freq_cs 0.006597 RandomForestClassifier
6 char_freq_%5B 0.006599 RandomForestClassifier
7 word_freq_table 0.006599 RandomForestClassifier
8 word_freq_receive 0.006601 RandomForestClassifier
9 word_freq_857 0.006603 RandomForestClassifier
10 _full_model_ 0.006603 RandomForestClassifier
11 word_freq_telnet 0.006611 RandomForestClassifier
12 word_freq_direct 0.006613 RandomForestClassifier
13 word_freq_all 0.006617 RandomForestClassifier
14 word_freq_3d 0.006617 RandomForestClassifier
15 word_freq_415 0.006621 RandomForestClassifier
16 word_freq_original 0.006623 RandomForestClassifier
17 word_freq_pm 0.006625 RandomForestClassifier
18 word_freq_credit 0.006627 RandomForestClassifier
19 word_freq_project 0.006630 RandomForestClassifier
20 word_freq_addresses 0.006638 RandomForestClassifier
21 word_freq_make 0.006640 RandomForestClassifier
22 word_freq_labs 0.006648 RandomForestClassifier
23 word_freq_parts 0.006658 RandomForestClassifier
24 char_freq_%28 0.006660 RandomForestClassifier
25 word_freq_over 0.006665 RandomForestClassifier
26 word_freq_address 0.006687 RandomForestClassifier
27 word_freq_lab 0.006691 RandomForestClassifier
28 word_freq_mail 0.006698 RandomForestClassifier
29 word_freq_technology 0.006708 RandomForestClassifier
30 word_freq_email 0.006724 RandomForestClassifier
31 word_freq_people 0.006729 RandomForestClassifier
32 word_freq_order 0.006733 RandomForestClassifier
33 word_freq_1999 0.006743 RandomForestClassifier
34 word_freq_internet 0.006764 RandomForestClassifier
35 char_freq_%3B 0.006799 RandomForestClassifier
36 word_freq_will 0.006825 RandomForestClassifier
37 word_freq_re 0.006850 RandomForestClassifier
38 word_freq_hpl 0.006918 RandomForestClassifier
39 word_freq_meeting 0.006937 RandomForestClassifier
40 word_freq_you 0.006999 RandomForestClassifier
41 word_freq_650 0.007009 RandomForestClassifier
42 word_freq_our 0.007225 RandomForestClassifier
43 word_freq_font 0.007398 RandomForestClassifier
44 word_freq_business 0.007625 RandomForestClassifier
45 word_freq_your 0.008427 RandomForestClassifier
46 word_freq_000 0.008924 RandomForestClassifier
47 word_freq_money 0.009429 RandomForestClassifier
48 word_freq_edu 0.009889 RandomForestClassifier
49 word_freq_george 0.010276 RandomForestClassifier
50 word_freq_free 0.012529 RandomForestClassifier
51 capital_run_length_average 0.012735 RandomForestClassifier
52 word_freq_hp 0.013267 RandomForestClassifier
53 char_freq_%24 0.018916 RandomForestClassifier
54 word_freq_remove 0.025955 RandomForestClassifier
55 char_freq_%21 0.026308 RandomForestClassifier
56 _baseline_ 0.489602 RandomForestClassifier
In [16]:
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)
In [35]:
featureimp = pd.DataFrame(data = {"Variable": RF_final.feature_names_in_, "Importance": RF_final.feature_importances_})
featureimp
Out[35]:
Variable Importance
0 word_freq_make 0.001165
1 word_freq_address 0.001765
2 word_freq_all 0.002796
3 word_freq_3d 0.000314
4 word_freq_our 0.024775
5 word_freq_over 0.003399
6 word_freq_remove 0.134845
7 word_freq_internet 0.007125
8 word_freq_order 0.002147
9 word_freq_mail 0.002673
10 word_freq_receive 0.004296
11 word_freq_will 0.005077
12 word_freq_people 0.001294
13 word_freq_report 0.000896
14 word_freq_addresses 0.000538
15 word_freq_free 0.082156
16 word_freq_business 0.007490
17 word_freq_email 0.003160
18 word_freq_you 0.011135
19 word_freq_credit 0.002346
20 word_freq_your 0.058051
21 word_freq_font 0.001764
22 word_freq_000 0.019361
23 word_freq_money 0.036021
24 word_freq_hp 0.052018
25 word_freq_hpl 0.011815
26 word_freq_george 0.019223
27 word_freq_650 0.002613
28 word_freq_lab 0.000827
29 word_freq_labs 0.001923
30 word_freq_telnet 0.001212
31 word_freq_857 0.000208
32 word_freq_data 0.001240
33 word_freq_415 0.000344
34 word_freq_85 0.001625
35 word_freq_technology 0.001655
36 word_freq_1999 0.006814
37 word_freq_parts 0.000221
38 word_freq_pm 0.002059
39 word_freq_direct 0.000368
40 word_freq_cs 0.000530
41 word_freq_meeting 0.007260
42 word_freq_original 0.000916
43 word_freq_project 0.001113
44 word_freq_re 0.006195
45 word_freq_edu 0.017709
46 word_freq_table 0.000126
47 word_freq_conference 0.000787
48 char_freq_%3B 0.001909
49 char_freq_%28 0.006084
50 char_freq_%5B 0.000869
51 char_freq_%21 0.191269
52 char_freq_%24 0.171319
53 char_freq_%23 0.001338
54 capital_run_length_average 0.073821
In [36]:
import shap
shapExplainer = shap.TreeExplainer(RF_final)
explanation = shapExplainer(X_test)
shap_values = explanation.values
In [41]:
shap_values.shape
Out[41]:
(461, 55, 2)
In [44]:
shap.plots.beeswarm(explanation[:,:,0])
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Other ablations of RF¶

RF with less depth¶

In [25]:
RF_2 = RandomForestClassifier(n_estimators=200, max_depth = 3, max_features = 0.3, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\1484883017.py:1: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Train accuracy:  0.9553140096618358
Test accuracy:  0.9609544468546638
In [26]:
RF2explainer = dx.Explainer(RF_2, X_test, y_test)
RF2explainer.model_performance()
Preparation of a new explainer is initiated

  -> data              : 461 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 461 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x00000137E7081940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0888, mean = 0.412, max = 0.945
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.65, mean = 7.8e-05, max = 0.855
  -> model_info        : package sklearn

A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

Out[26]:
recall precision f1 accuracy auc
RandomForestClassifier 0.884211 0.976744 0.928177 0.943601 0.983842
In [27]:
pvi2 = RF2explainer.model_parts(random_state=0)
pvi2.plot(show=False).update_layout(autosize=False, width=600, height=450)

RF with less variables¶

In [28]:
RF_3 = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.1, random_state = 1).fit(X_train, y_train)
print("Train accuracy: ", accuracy_score(y_train, RF_final.predict(X_train)))
print("Test accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_22288\2769589397.py:1: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Train accuracy:  0.9553140096618358
Test accuracy:  0.9609544468546638
In [29]:
RF3explainer = dx.Explainer(RF_3, X_test, y_test)
RF3explainer.model_performance()
Preparation of a new explainer is initiated

  -> data              : 461 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 461 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x00000137E7081940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00852, mean = 0.415, max = 0.996
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.637, mean = -0.003, max = 0.778
  -> model_info        : package sklearn

A new explainer has been created!
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

Out[29]:
recall precision f1 accuracy auc
RandomForestClassifier 0.894737 0.982659 0.936639 0.950108 0.992795
In [30]:
pvi3 = RF3explainer.model_parts(random_state=0)
pvi3.plot(show=False).update_layout(autosize=False, width=600, height=450)

XGBOOST¶

In [17]:
import xgboost
In [18]:
model = xgboost.XGBClassifier(
    n_estimators=50,
    max_depth=2,
    use_label_encoder=False,
    eval_metric="logloss",
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)
C:\Users\Antek\anaconda3\lib\site-packages\xgboost\sklearn.py:1395: UserWarning:

`use_label_encoder` is deprecated in 1.7.0.

Out[18]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)
In [20]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]
In [21]:
XGexplainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated

  -> data              : 461 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 461 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x00000137ECA9CE50> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 7.59e-05, mean = 0.42, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.793, mean = -0.00801, max = 0.759
  -> model_info        : package xgboost

A new explainer has been created!
In [22]:
XGpvi = XGexplainer.model_parts(random_state=0)
In [23]:
XGpvi.result
Out[23]:
variable dropout_loss label
0 word_freq_lab 0.003849 XGBClassifier
1 word_freq_technology 0.003981 XGBClassifier
2 word_freq_report 0.003997 XGBClassifier
3 word_freq_857 0.004001 XGBClassifier
4 word_freq_email 0.004001 XGBClassifier
5 word_freq_direct 0.004001 XGBClassifier
6 word_freq_data 0.004001 XGBClassifier
7 word_freq_people 0.004001 XGBClassifier
8 word_freq_mail 0.004001 XGBClassifier
9 word_freq_make 0.004001 XGBClassifier
10 word_freq_all 0.004001 XGBClassifier
11 word_freq_addresses 0.004001 XGBClassifier
12 word_freq_address 0.004001 XGBClassifier
13 word_freq_font 0.004001 XGBClassifier
14 word_freq_85 0.004001 XGBClassifier
15 word_freq_415 0.004001 XGBClassifier
16 word_freq_3d 0.004001 XGBClassifier
17 word_freq_receive 0.004001 XGBClassifier
18 word_freq_original 0.004001 XGBClassifier
19 char_freq_%5B 0.004001 XGBClassifier
20 char_freq_%3B 0.004001 XGBClassifier
21 word_freq_telnet 0.004001 XGBClassifier
22 word_freq_parts 0.004001 XGBClassifier
23 _full_model_ 0.004001 XGBClassifier
24 word_freq_table 0.004001 XGBClassifier
25 word_freq_labs 0.004001 XGBClassifier
26 word_freq_credit 0.004003 XGBClassifier
27 char_freq_%23 0.004018 XGBClassifier
28 word_freq_project 0.004024 XGBClassifier
29 word_freq_cs 0.004034 XGBClassifier
30 word_freq_order 0.004051 XGBClassifier
31 word_freq_hpl 0.004067 XGBClassifier
32 word_freq_pm 0.004073 XGBClassifier
33 word_freq_conference 0.004172 XGBClassifier
34 char_freq_%28 0.004211 XGBClassifier
35 word_freq_1999 0.004263 XGBClassifier
36 word_freq_will 0.004294 XGBClassifier
37 word_freq_over 0.004333 XGBClassifier
38 word_freq_your 0.004403 XGBClassifier
39 word_freq_internet 0.004479 XGBClassifier
40 word_freq_000 0.004512 XGBClassifier
41 word_freq_you 0.004682 XGBClassifier
42 word_freq_money 0.004747 XGBClassifier
43 word_freq_business 0.004748 XGBClassifier
44 word_freq_re 0.004954 XGBClassifier
45 word_freq_meeting 0.005331 XGBClassifier
46 word_freq_free 0.005690 XGBClassifier
47 word_freq_our 0.005928 XGBClassifier
48 word_freq_650 0.006205 XGBClassifier
49 word_freq_edu 0.007174 XGBClassifier
50 capital_run_length_average 0.009688 XGBClassifier
51 char_freq_%24 0.010285 XGBClassifier
52 char_freq_%21 0.011817 XGBClassifier
53 word_freq_remove 0.012744 XGBClassifier
54 word_freq_george 0.017075 XGBClassifier
55 word_freq_hp 0.022303 XGBClassifier
56 _baseline_ 0.492079 XGBClassifier
In [24]:
XGpvi.plot(show=False).update_layout(autosize=False, width=600, height=450)